import sqlite3
import pandas as pdIntroduction
Structured Query Language (SQL) is a powerful tool for managing and manipulating relational databases. It is essential for data scientists, analysts, and anyone working with large datasets. This chapter will explore the importance of SQL, its applications, and provide example code format to illustrate its utility.
Why SQL is Important
- Data Management: SQL allows for efficient management of large volumes of data. It provides the means to create, read, update, and delete data in a relational database.
- Data Retrieval: With SQL, you can perform complex queries to retrieve specific data from one or more tables, making it easier to analyze and draw insights.
- Data Manipulation: SQL enables the manipulation of data through operations such as sorting, filtering, and aggregating. This is crucial for data cleaning and preprocessing.
- Data Integration: SQL supports the integration of data from different sources, allowing for comprehensive data analysis.
- Standardization: SQL is a standardized language used by most relational database management systems (RDBMS), making it a versatile and essential skill for professionals in the field.
Basic SQL Concepts
Basic SQL Concepts
To illustrate the following SQL concepts, we will use the employees table with the following data:
Table: employees
| id | first_name | last_name | department | salary |
|---|---|---|---|---|
| 1 | Alice | Smith | HR | 60000 |
| 2 | Bob | Johnson | IT | 80000 |
| 3 | Charlie | Lee | Sales | 55000 |
| 4 | David | Kim | HR | 75000 |
| 5 | Eva | Brown | IT | 65000 |
| 6 | Frank | Wilson | Sales | 70000 |
| 7 | Grace | Taylor | HR | 62000 |
| 8 | Henry | Anderson | IT | 77000 |
| 9 | Irene | Thomas | Sales | 53000 |
| 10 | Jack | White | HR | 58000 |
| 11 | Karen | Harris | IT | 69000 |
| 12 | Leo | Martin | Sales | 50000 |
| 13 | Mia | Jackson | HR | 64000 |
| 14 | Noah | Lee | IT | 72000 |
| 15 | Olivia | Perez | Sales | 68000 |
| 16 | Paul | Young | HR | 61000 |
| 17 | Quinn | King | IT | 76000 |
| 18 | Rachel | Scott | Sales | 57000 |
| 19 | Sam | Green | HR | 63000 |
| 20 | Tina | Adams | IT | 81000 |
SELECT and FROM
The SELECT statement is used to fetch data from a database, and the FROM clause specifies the table.
-- Selecting all columns from a table
p = """
SELECT *
FROM
employees;
"""
pd.read_sql_query(p, con)| id | first_name | last_name | department | salary |
|---|---|---|---|---|
| 1 | Alice | Smith | HR | 60000 |
| 2 | Bob | Johnson | IT | 80000 |
| 3 | Charlie | Lee | Sales | 55000 |
| 4 | David | Kim | HR | 75000 |
| 5 | Eva | Brown | IT | 65000 |
| 6 | Frank | Wilson | Sales | 70000 |
| 7 | Grace | Taylor | HR | 62000 |
| 8 | Henry | Anderson | IT | 77000 |
| 9 | Irene | Thomas | Sales | 53000 |
| 10 | Jack | White | HR | 58000 |
| 11 | Karen | Harris | IT | 69000 |
| 12 | Leo | Martin | Sales | 50000 |
| 13 | Mia | Jackson | HR | 64000 |
| 14 | Noah | Lee | IT | 72000 |
| 15 | Olivia | Perez | Sales | 68000 |
| 16 | Paul | Young | HR | 61000 |
| 17 | Quinn | King | IT | 76000 |
| 18 | Rachel | Scott | Sales | 57000 |
| 19 | Sam | Green | HR | 63000 |
| 20 | Tina | Adams | IT | 81000 |
-- Selecting specific columns
p = """
SELECT
first_name,
last_name,
salary
FROM
employees;
"""
pd.read_sql_query(p, con)| id | first_name | last_name | salary |
|---|---|---|---|
| 1 | Alice | Smith | 60000 |
| 2 | Bob | Johnson | 80000 |
| 3 | Charlie | Lee | 55000 |
| 4 | David | Kim | 75000 |
| 5 | Eva | Brown | 65000 |
| 6 | Frank | Wilson | 70000 |
| 7 | Grace | Taylor | 62000 |
| 8 | Henry | Anderson | 77000 |
| 9 | Irene | Thomas | 53000 |
| 10 | Jack | White | 58000 |
| 11 | Karen | Harris | 69000 |
| 12 | Leo | Martin | 50000 |
| 13 | Mia | Jackson | 64000 |
| 14 | Noah | Lee | 72000 |
| 15 | Olivia | Perez | 68000 |
| 16 | Paul | Young | 61000 |
| 17 | Quinn | King | 76000 |
| 18 | Rachel | Scott | 57000 |
| 19 | Sam | Green | 63000 |
| 20 | Tina | Adams | 81000 |
SELECT EXCLUDE and RENAME
You can exclude columns using SELECT and rename them for clarity.
-- Selecting all but one column
p = """
SELECT *
EXCLUDE
salary
FROM
employees;
"""
pd.read_sql_query(p, con)| id | first_name | last_name | department |
|---|---|---|---|
| 1 | Alice | Smith | HR |
| 2 | Bob | Johnson | IT |
| 3 | Charlie | Lee | Sales |
| 4 | David | Kim | HR |
| 5 | Eva | Brown | IT |
| 6 | Frank | Wilson | Sales |
| 7 | Grace | Taylor | HR |
| 8 | Henry | Anderson | IT |
| 9 | Irene | Thomas | Sales |
| 10 | Jack | White | HR |
| 11 | Karen | Harris | IT |
| 12 | Leo | Martin | Sales |
| 13 | Mia | Jackson | HR |
| 14 | Noah | Lee | IT |
| 15 | Olivia | Perez | Sales |
| 16 | Paul | Young | HR |
| 17 | Quinn | King | IT |
| 18 | Rachel | Scott | Sales |
| 19 | Sam | Green | HR |
| 20 | Tina | Adams | IT |
-- Renaming columns
p = """
SELECT
first_name AS fname,
last_name AS lname
FROM
employees;
"""
pd.read_sql_query(p, con)| fname | lname |
|---|---|
| Alice | Smith |
| Bob | Johnson |
| Charlie | Lee |
| David | Kim |
| Eva | Brown |
| Frank | Wilson |
| Grace | Taylor |
| Henry | Anderson |
| Irene | Thomas |
| Jack | White |
| Karen | Harris |
| Leo | Martin |
| Mia | Jackson |
| Noah | Lee |
| Olivia | Perez |
| Paul | Young |
| Quinn | King |
| Rachel | Scott |
| Sam | Green |
| Tina | Adams |
LIMIT and OFFSET
The LIMIT clause restricts the number of rows returned, and OFFSET skips rows before beginning to return rows.
-- Limiting the number of rows returned
p = """
SELECT *
FROM
employees
LIMIT
10;
"""
pd.read_sql_query(p, con)| id | first_name | last_name | department | salary |
|---|---|---|---|---|
| 1 | Alice | Smith | HR | 60000 |
| 2 | Bob | Johnson | IT | 80000 |
| 3 | Charlie | Lee | Sales | 55000 |
| 4 | David | Kim | HR | 75000 |
| 5 | Eva | Brown | IT | 65000 |
| 6 | Frank | Wilson | Sales | 70000 |
| 7 | Grace | Taylor | HR | 62000 |
| 8 | Henry | Anderson | IT | 77000 |
| 9 | Irene | Thomas | Sales | 53000 |
| 10 | Jack | White | HR | 58000 |
-- Skipping rows
p = """
SELECT *
FROM
employees
LIMIT 10
OFFSET 5;
"""
pd.read_sql_query(p, con)| id | first_name | last_name | department | salary |
|---|---|---|---|---|
| 6 | Frank | Wilson | Sales | 70000 |
| 7 | Grace | Taylor | HR | 62000 |
| 8 | Henry | Anderson | IT | 77000 |
| 9 | Irene | Thomas | Sales | 53000 |
| 10 | Jack | White | HR | 58000 |
| 11 | Karen | Harris | IT | 69000 |
| 12 | Leo | Martin | Sales | 50000 |
| 13 | Mia | Jackson | HR | 64000 |
| 14 | Noah | Lee | IT | 72000 |
| 15 | Olivia | Perez | Sales | 68000 |
ORDER BY
The ORDER BY clause sorts the result set.
-- Sorting the result set by salary in ascending order
p = """
SELECT *
FROM
employees
ORDER BY
salary;
"""
pd.read_sql_query(p, con)| id | first_name | last_name | department | salary |
|---|---|---|---|---|
| 12 | Leo | Martin | Sales | 50000 |
| 9 | Irene | Thomas | Sales | 53000 |
| 3 | Charlie | Lee | Sales | 55000 |
| 10 | Jack | White | HR | 58000 |
| 1 | Alice | Smith | HR | 60000 |
| 19 | Sam | Green | HR | 63000 |
| 7 | Grace | Taylor | HR | 62000 |
| 13 | Mia | Jackson | HR | 64000 |
| 5 | Eva | Brown | IT | 65000 |
| 11 | Karen | Harris | IT | 69000 |
| 18 | Rachel | Scott | Sales | 57000 |
| 15 | Olivia | Perez | Sales | 68000 |
| 6 | Frank | Wilson | Sales | 70000 |
| 14 | Noah | Lee | IT | 72000 |
| 17 | Quinn | King | IT | 76000 |
| 8 | Henry | Anderson | IT | 77000 |
| 4 | David | Kim | HR | 75000 |
| 2 | Bob | Johnson | IT | 80000 |
| 20 | Tina | Adams | IT | 81000 |
| 16 | Paul | Young | HR | 61000 |
-- Sorting in descending order
p = """
SELECT *
FROM
employees
ORDER BY
salary DESC;
"""
pd.read_sql_query(p, con)| id | first_name | last_name | department | salary |
|---|---|---|---|---|
| 20 | Tina | Adams | IT | 81000 |
| 2 | Bob | Johnson | IT | 80000 |
| 8 | Henry | Anderson | IT | 77000 |
| 17 | Quinn | King | IT | 76000 |
| 4 | David | Kim | HR | 75000 |
| 14 | Noah | Lee | IT | 72000 |
| 6 | Frank | Wilson | Sales | 70000 |
| 15 | Olivia | Perez | Sales | 68000 |
| 11 | Karen | Harris | IT | 69000 |
| 5 | Eva | Brown | IT | 65000 |
| 13 | Mia | Jackson | HR | 64000 |
| 19 | Sam | Green | HR | 63000 |
| 7 | Grace | Taylor | HR | 62000 |
| 16 | Paul | Young | HR | 61000 |
| 1 | Alice | Smith | HR | 60000 |
| 10 | Jack | White | HR | 58000 |
| 3 | Charlie | Lee | Sales | 55000 |
| 9 | Irene | Thomas | Sales | 53000 |
| 18 | Rachel | Scott | Sales | 57000 |
| 12 | Leo | Martin | Sales | 50000 |
AND, OR, NOT
Logical operators filter records based on multiple conditions.
-- Using AND, OR, NOT operators
p = """
SELECT *
FROM
employees
WHERE
department = 'Sales' AND salary > 50000;
"""
pd.read_sql_query(p, con)| id | first_name | last_name | department | salary |
|---|---|---|---|---|
| 6 | Frank | Wilson | Sales | 70000 |
| 15 | Olivia | Perez | Sales | 68000 |
| 18 | Rachel | Scott | Sales | 57000 |
Numeric Operations
Perform arithmetic operations in SQL.
-- Calculating a new column
p = """
SELECT
first_name,
last_name,
salary,
salary * 1.1 AS new_salary
FROM
employees;
"""
pd.read_sql_query(p, con)| first_name | last_name | salary | new_salary |
|---|---|---|---|
| Alice | Smith | 60000 | 66000.0 |
| Bob | Johnson | 80000 | 88000.0 |
| Charlie | Lee | 55000 | 60500.0 |
| David | Kim | 75000 | 82500.0 |
| Eva | Brown | 65000 | 71500.0 |
| Frank | Wilson | 70000 | 77000.0 |
| Grace | Taylor | 62000 | 68200.0 |
| Henry | Anderson | 77000 | 84700.0 |
| Irene | Thomas | 53000 | 58300.0 |
| Jack | White | 58000 | 63800.0 |
| Karen | Harris | 69000 | 75900.0 |
| Leo | Martin | 50000 | 55000.0 |
| Mia | Jackson | 64000 | 70400.0 |
| Noah | Lee | 72000 | 79200.0 |
| Olivia | Perez | 68000 | 74800.0 |
| Paul | Young | 61000 | 67100.0 |
| Quinn | King | 76000 | 83600.0 |
| Rachel | Scott | 57000 | 62700.0 |
| Sam | Green | 63000 | 69300.0 |
| Tina | Adams | 81000 | 89100.0 |
LIKE and NOT LIKE
Pattern matching using LIKE.
-- Pattern matching
p = """
SELECT *
FROM
employees
WHERE
last_name
LIKE 'S%';
"""
pd.read_sql_query(p, con)| id | first_name | last_name | department | salary |
|---|---|---|---|---|
| 1 | Alice | Smith | HR | 60000 |
| 18 | Rachel | Scott | Sales | 57000 |
BETWEEN
Range filtering using BETWEEN.
-- Filtering within a range
p = """
SELECT *
FROM
employees
WHERE
salary BETWEEN 40000 AND 60000;
"""
pd.read_sql_query(p, con)| id | first_name | last_name | department | salary |
|---|---|---|---|---|
| 12 | Leo | Martin | Sales | 50000 |
| 3 | Charlie | Lee | Sales | 55000 |
| 9 | Irene | Thomas | Sales | 53000 |
| 10 | Jack | White | HR | 58000 |
| 1 | Alice | Smith | HR | 60000 |
OFFSET
Skip a specific number of rows before starting to return rows.
-- Skipping the first 5 rows
p = """
SELECT *
FROM
employees
OFFSET 5;
"""
pd.read_sql_query(p, con)| id | first_name | last_name | department | salary |
|---|---|---|---|---|
| 6 | Frank | Wilson | Sales | 70000 |
| 7 | Grace | Taylor | HR | 62000 |
| 8 | Henry | Anderson | IT | 77000 |
| 9 | Irene | Thomas | Sales | 53000 |
| 10 | Jack | White | HR | 58000 |
| 11 | Karen | Harris | IT | 69000 |
| 12 | Leo | Martin | Sales | 50000 |
| 13 | Mia | Jackson | HR | 64000 |
| 14 | Noah | Lee | IT | 72000 |
| 15 | Olivia | Perez | Sales | 68000 |
| 16 | Paul | Young | HR | 61000 |
| 17 | Quinn | King | IT | 76000 |
| 18 | Rachel | Scott | Sales | 57000 |
| 19 | Sam | Green | HR | 63000 |
| 20 | Tina | Adams | IT | 81000 |
Intermediate SQL Concepts
Joins
Table: employees
| id | first_name | last_name | department_id | salary |
|---|---|---|---|---|
| 1 | Alice | Smith | 1 | 60000 |
| 2 | Bob | Johnson | 2 | 80000 |
| 3 | Charlie | Lee | 3 | 55000 |
| 4 | David | Kim | 1 | 75000 |
| 5 | Eva | Brown | 2 | 65000 |
| 6 | Frank | Wilson | 3 | 70000 |
| 7 | Grace | Taylor | 1 | 62000 |
| 8 | Henry | Anderson | 2 | 77000 |
| 9 | Irene | Thomas | 3 | 53000 |
| 10 | Jack | White | 1 | 58000 |
| 11 | Karen | Harris | 2 | 69000 |
| 12 | Leo | Martin | 3 | 50000 |
| 13 | Mia | Jackson | 1 | 64000 |
| 14 | Noah | Lee | 2 | 72000 |
| 15 | Olivia | Perez | 3 | 68000 |
| 16 | Paul | Young | 1 | 61000 |
| 17 | Quinn | King | 2 | 76000 |
| 18 | Rachel | Scott | 3 | 57000 |
| 19 | Sam | Green | 1 | 63000 |
| 20 | Tina | Adams | 2 | 81000 |
Table: departments
| department_id | department_name |
|---|---|
| 1 | HR |
| 2 | IT |
| 3 | Sales |
Combine rows from two or more tables based on a related column.
-- Inner join example
p = """
SELECT
employees.first_name,
employees.last_name,
departments.department_name
FROM
employees
INNER JOIN
departments ON employees.department_id = departments.department_id;
"""
pd.read_sql_query(p, con)| first_name | last_name | department_name |
|---|---|---|
| Alice | Smith | HR |
| David | Kim | HR |
| Grace | Taylor | HR |
| Jack | White | HR |
| Mia | Jackson | HR |
| Paul | Young | HR |
| Sam | Green | HR |
| Bob | Johnson | IT |
| Eva | Brown | IT |
| Henry | Anderson | IT |
| Karen | Harris | IT |
| Noah | Lee | IT |
| Quinn | King | IT |
| Tina | Adams | IT |
| Charlie | Lee | Sales |
| Frank | Wilson | Sales |
| Irene | Thomas | Sales |
| Leo | Martin | Sales |
| Olivia | Perez | Sales |
| Rachel | Scott | Sales |
CAST
Convert data from one type to another.
-- Casting a column
p = """
SELECT
CAST(salary AS DECIMAL(10, 2))
FROM
employees;
"""
pd.read_sql_query(p, con)| salary |
|---|
| 60000.00 |
| 80000.00 |
| 55000.00 |
| 75000.00 |
| 65000.00 |
| 70000.00 |
| 62000.00 |
| 77000.00 |
| 53000.00 |
| 58000.00 |
| 69000.00 |
| 50000.00 |
| 64000.00 |
| 72000.00 |
| 68000.00 |
| 61000.00 |
| 76000.00 |
| 57000.00 |
| 63000.00 |
| 81000.00 |
Aggregations
Perform calculations on a set of values.
-- Using aggregation functions
p = """
SELECT
department_id,
COUNT(employee_id) AS num_employees
FROM
employees
GROUP BY
department_id;
"""
pd.read_sql_query(p, con)| department_id | num_employees |
|---|---|
| 1 | 7 |
| 2 | 7 |
| 3 | 6 |
GROUP BY and HAVING
Group rows that have the same values and filter groups.
-- Grouping rows and filtering groups
p = """
SELECT
department_id,
COUNT(employee_id) AS num_employees
FROM
employees
GROUP BY
department_id
HAVING COUNT(employee_id) > 5;
"""
pd.read_sql_query(p, con)| department_id | num_employees |
|---|---|
| 1 | 7 |
| 2 | 7 |
| 3 | 6 |
UNION, INTERSECT, MINUS
Combine result sets
Table: managers
| manager_id | first_name | last_name |
|---|---|---|
| 1 | Michael | Brown |
| 2 | Sarah | Johnson |
| 3 | John | Lee |
-- Union example
p = """
SELECT
first_name
FROM
employees
UNION
SELECT
first_name
FROM
managers;
"""
pd.read_sql_query(p, con)| first_name |
|---|
| Alice |
| Bob |
| Charlie |
| David |
| Eva |
| Frank |
| Grace |
| Henry |
| Irene |
| Jack |
| Karen |
| Leo |
| Mia |
| Noah |
| Olivia |
| Paul |
| Quinn |
| Rachel |
| Sam |
| Tina |
| Michael |
| Sarah |
| John |
POSITION
Find the position of a substring.
-- Finding substring position
p = """
SELECT
POSITION('e' IN first_name)
FROM
employees;
"""
pd.read_sql_query(p, con)| position |
|---|
| 0 |
| 0 |
| 4 |
| 0 |
| 0 |
| 2 |
| 0 |
| 3 |
| 2 |
| 0 |
| 0 |
| 3 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 2 |
| 0 |
| 0 |
CASE
Conditional logic in `SQL.
-- Using CASE statements
p = """
SELECT
first_name,
last_name,
CASE
WHEN salary > 60000 THEN 'High'
WHEN salary BETWEEN 40000 AND 60000 THEN 'Medium'
ELSE 'Low'
END AS salary_category
FROM
employees;
"""
pd.read_sql_query(p, con)| first_name | last_name | salary_category |
|---|---|---|
| Alice | Smith | Medium |
| Bob | Johnson | High |
| Charlie | Lee | Medium |
| David | Kim | High |
| Eva | Brown | Medium |
| Frank | Wilson | High |
| Grace | Taylor | Medium |
| Henry | Anderson | High |
| Irene | Thomas | Medium |
| Jack | White | Medium |
| Karen | Harris | High |
| Leo | Martin | Medium |
| Mia | Jackson | Medium |
| Noah | Lee | High |
| Olivia | Perez | High |
| Paul | Young | Medium |
| Quinn | King | High |
| Rachel | Scott | Medium |
| Sam | Green | Medium |
| Tina | Adams | High |
Joins
Introduction
It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.
pandas has a really rich set of options for combining one or more data frames, with the two most important being concatenate and merge. Some of the examples in this chapter show you how to join a pair of data frames. Fortunately this is enough, since you can combine three data frames by combining two pairs.
# remove cell
import matplotlib_inline.backend_inline
import matplotlib.pyplot as plt
# Plot settings
plt.style.use("https://github.com/aeturrell/python4DS/raw/main/plot_style.txt")
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")Prerequisites
This chapter will use the pandas data analysis package.
Concatenate
If you have two or more data frames with the same index or the same columns, you can glue them together into a single data frame using pd.concat().
For the same columns, pass axis=0 to glue the index together; for the same index, pass axis=1 to glue the columns together. The concatenate function will typically be used on a list of data frames.
If you want to track where the original data came from in the final data frame, use the keys keyword.
Here’s an example using data on two different states’ populations that also makes uses of the keys option:
import pandas as pd
import urllib.request
base_url = "http://www.stata-press.com/data/r14/"
state_codes = ["ca", "il"]
end_url = "pop.dta"
headers = {'User-Agent': 'Mozilla/5.0'}
def fetch_data(url):
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
return pd.read_stata(response)
# This grabs the two data frames, one for each state
list_of_state_dfs = [fetch_data(base_url + state + end_url) for state in state_codes]
# Show example of first entry in list of data frames
print(list_of_state_dfs[0]) county pop
0 Los Angeles 9878554
1 Orange 2997033
2 Ventura 798364
# Concatenate the list of data frames
df = pd.concat(list_of_state_dfs, keys=state_codes, axis=0)Note that the keys argument is optional, but is useful for keeping track of origin data frames within the merged data frame.
Exercise
Concatenate the follow two data frames:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])Merge
There are so many options for merging data frames using pd.merge(left, right, on=..., how=... that we won’t be able to cover them all here. The most important features are: the two data frames to be merged, what variables (aka keys) to merge on (and these can be indexes) via on=, and how to do the merge (eg left, right, outer, inner) via how=. This diagram shows an example of a merge using keys from the left-hand data frame:
The how= keyword works in the following ways: - how='left' uses keys from the left data frame only to merge. - how='right' uses keys from the right data frame only to merge. - how='inner' uses keys that appear in both data frames to merge. - how='outer' uses the cartesian product of keys in both data frames to merge on.
Let’s see examples of some of these:
left = pd.DataFrame(
{
"key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
# Right merge
pd.merge(left, right, on=["key1", "key2"], how="right")| key1 | key2 | A | B | C | D | |
|---|---|---|---|---|---|---|
| 0 | K0 | K0 | A0 | B0 | C0 | D0 |
| 1 | K1 | K0 | A2 | B2 | C1 | D1 |
| 2 | K1 | K0 | A2 | B2 | C2 | D2 |
| 3 | K2 | K0 | NaN | NaN | C3 | D3 |
Note that the key combination of K2 and K0 did not exist in the left-hand data frame, and so its entries in the final data frame are NaNs. But it does have entries because we chose the keys from the right-hand data frame.
What about an inner merge?
pd.merge(left, right, on=["key1", "key2"], how="inner")| key1 | key2 | A | B | C | D | |
|---|---|---|---|---|---|---|
| 0 | K0 | K0 | A0 | B0 | C0 | D0 |
| 1 | K1 | K0 | A2 | B2 | C1 | D1 |
| 2 | K1 | K0 | A2 | B2 | C2 | D2 |
Now we see that the combination K2 and K0 are excluded because they didn’t exist in the overlap of keys in both data frames.
Finally, let’s take a look at an outer merge that comes with some extra info via the indicator keyword:
pd.merge(left, right, on=["key1", "key2"], how="outer", indicator=True)| key1 | key2 | A | B | C | D | _merge | |
|---|---|---|---|---|---|---|---|
| 0 | K0 | K0 | A0 | B0 | C0 | D0 | both |
| 1 | K0 | K1 | A1 | B1 | NaN | NaN | left_only |
| 2 | K1 | K0 | A2 | B2 | C1 | D1 | both |
| 3 | K1 | K0 | A2 | B2 | C2 | D2 | both |
| 4 | K2 | K0 | NaN | NaN | C3 | D3 | right_only |
| 5 | K2 | K1 | A3 | B3 | NaN | NaN | left_only |
Now we can see that the products of all key combinations are here. The indicator=True option has caused an extra column to be added, called ’_merge’, that tells us which data frame the keys on that row came from.
Exercise
Merge the following two data frames using the left_on and right_on keyword arguments to specify a join on lkey and rkey respectively:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})Exercise
Merge the following two data frames on "a" using how="left" as a keyword argument:
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})What do you notice about the position .loc[1, "c"] in the merged data frame?
For more on the options for merging, see pandas’ comprehensive merging documentation.
Introduction to SQL Joins
SQL joins are used to combine rows from two or more tables based on a related column between them. Understanding joins is crucial for effective data retrieval and manipulation. This section will cover the four main types of joins: INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN, using visual aids and code snippets with a fake data frame.
Types of Joins
INNER JOIN
An INNER JOIN returns only the rows that have matching values in both tables.

-- Inner join example
p = """
SELECT
A.id,
A.name,
B.order_id
FROM
Customers A
INNER JOIN
Orders B ON A.id = B.customer_id;
"""
pd.read_sql_query(p, con)LEFT JOIN
A LEFT JOIN returns all the rows from the left table and the matched rows from the right table. If no match is found, NULL values are returned for columns from the right table.

-- Left join example
p = """
SELECT
A.id,
A.name,
B.order_id
FROM
Customers A
LEFT JOIN
Orders B ON A.id = B.customer_id;
"""
pd.read_sql_query(p, con)RIGHT JOIN
A RIGHT JOIN returns all the rows from the right table and the matched rows from the left table. If no match is found, NULL values are returned for columns from the left table.

p = """
SELECT
A.id,
A.name,
B.order_id
FROM
Customers A
RIGHT JOIN
Orders B ON A.id = B.customer_id;
"""
pd.read_sql_query(p, con)FULL JOIN
A FULL JOIN returns all the rows when there is a match in either left or right table. Rows without a match in one of the tables will contain NULL values for columns from that table.

-- Full join example
p = """
SELECT
A.id,
A.name,
B.order_id
FROM
Customers A
FULL JOIN
Orders B ON A.id = B.customer_id;
"""
pd.read_sql_query(p, con)Example Data Frames
To illustrate these joins, let’s consider two fake data frames: Customers and Orders.
import pandas as pd
# Creating a fake data frame for Customers
customers = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David']
})
# Creating a fake data frame for Orders
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104],
'customer_id': [1, 2, 2, 4]
})
# Display the data frames
print("Customers Data Frame")
print(customers)Customers Data Frame
id name
0 1 Alice
1 2 Bob
2 3 Charlie
3 4 David
print("Orders Data Frame")
print(orders)Orders Data Frame
order_id customer_id
0 101 1
1 102 2
2 103 2
3 104 4
Conclusion
SQL is an indispensable tool for anyone working with data. Its ability to manage, manipulate, and integrate data makes it essential for data analysis and decision-making. The examples provided in this chapter illustrate some of the key operations and techniques used in SQL, highlighting its importance in the field of data science.